Simplify evaluation and prediction to mirror training datasets and create tests to assert expected eval behavior #1253

bw4sz · 2025-12-30T17:50:37Z

This PR favors trainer.validate over main.evaluate since it closely matches pytorch lightning. I considered renaming main.evaluate to main.evaluate or deprecating it to hide it from users, I welcome input on that.
We want one standard evaluation path and then assert the behavior of this matches the inference path. I started on this road by adding tests to check parity between main.evaluate and trainer.validate.
I also simplified the outputs of trainer.validate since if this is going to be primary evaluation vehicle, it has too many map products for the average user.
I removed the size and batch size argument from main.evaluate and had it routed through the config, in general we want to config to control as much as possible Image resizing and performance during training and validation #1240 as we discussed here.
I simplified the prediction dataset workflow to just accept lists instead of tensors. This is faster, cleaner and easier to read. There is one additional complication for the MultiImage dataset since the batch comes from several images and we need to keep track of which in order to put it all back together.
I wrote several tests to assert constancy of actions between evaluate and predict making the whole system more cohesive.

AI-Assisted Development

[x ] I used AI tools (e.g., GitHub Copilot, ChatGPT, etc.) in developing this PR
[ x] I understand all the code I'm submitting
[ x] I have reviewed and validated all AI-generated code

AI tools used (if applicable):

I used cursor planning mode to help structure the tests, which i then edited and simplified.

Note

Major refactor to streamline augmentation, evaluation, prediction, and I/O with extensive tests.

Migrate augmentations to kornia (remove albumentations), add ZoomBlur and RandomPadTo, and switch dataset transforms to AugmentationSequential
Standardize evaluation via trainer.validate with simplified mAP logging; deprecate main.evaluate (internal __evaluate__ retained)
Simplify prediction datasets: use list-based batches, track sub-batch/window indices for MultiImage and tiled rasters, and update predict/predict_tile postprocessing
Overhaul utilities.read_file and geospatial handling with DeepForest_DataFrame, explicit image_path/label/root_dir assignment, and improved COCO/shape conversions
Update visualization to infer dimensions from image/root_dir (remove explicit width/height), and harden callbacks for empty annotations
Config/schema: add log_root; crop model: add bbox expand and dataset support
Adjust datasets (training, prediction, cropmodel) for new transforms, normalization, and box filtering; ensure 3-channel checks
Update docs (user guide, HISTORY), README link fix, and add kornia dependency; tweak codecov.yml to mark patch status informational
Add comprehensive tests for augmentations, datasets, callbacks, CLI, evaluation parity, crop model, DETR, and prediction batching

^{Written by Cursor Bugbot for commit 71728ea. This will update automatically on new commits. Configure here.}

cursor

This PR is being reviewed by Cursor Bugbot

Details

You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

tests/test_main.py

src/deepforest/datasets/prediction.py

src/deepforest/predict.py

cursor · 2026-01-09T20:06:21Z

src/deepforest/datasets/prediction.py

+        if len(self.sublist_lengths) > 0:
+            batch_sublist_lengths = self.sublist_lengths[batch_idx]
+            for idx, sub_idx in batch_sublist_lengths:
+                result = self.format_batch(batch[sub_idx], idx, sub_idx)


Wrong predictions accessed due to reset sub_idx in flattened batch

High Severity

In MultiImage.postprocess, batch[sub_idx] indexes into the flattened prediction results using sub_idx, which resets to 0 for each image in the batch (as stored in sublist_lengths). This causes predictions for the second and subsequent images to incorrectly retrieve prediction results from the first image. For example, with 2 images having 4 patches each, when processing the second image (idx=1), sub_idx values 0,1,2,3 cause batch[0:3] to be accessed instead of batch[4:7]. A running index into the flattened batch is needed instead.

Additional Locations (1)

src/deepforest/datasets/prediction.py#L360-L362

src/deepforest/datasets/prediction.py

src/deepforest/main.py

src/deepforest/datasets/prediction.py

src/deepforest/main.py

jveitchmichaelis

Only a few comments for readability + needs rebase, but looks good. I using Lightning-y methods is the right direction to avoid confusion. Once this is in, I'll rebase #1256 which should make this even tidier.

src/deepforest/datasets/prediction.py

src/deepforest/main.py

bw4sz · 2026-01-16T17:10:06Z

~~@jveitchmichaelis I can rebase if you are ready.~~

bw4sz · 2026-01-16T18:28:53Z

@jveitchmichaelis , @ethanwhite and I just discussed this. In this case, when you are ready, git squash and merge and we will just eat these history changes. I have learned my lesson about how to do this correctly in the future.

ethanwhite

Flagging to stop merge while I finish understanding the history complexities here

- Make trainer.validate the preferred evaluation method and standardize train, eval and predict to accept lists not batches. - Add a sublist concept for MultiImage datasets - Ensure predict_file follows the root_dir criteria for read_file

jveitchmichaelis · 2026-01-18T22:34:52Z

src/deepforest/datasets/prediction.py

-        paths (List[str]): A list of image paths.
        patch_size (int): Size of the patches to extract.
        patch_overlap (float): Overlap between patches.
        size (int): Target size to resize images to. Optional; if not provided, no resizing is performed.


Remove unused arg in docstring L30 + reference at L21

jveitchmichaelis · 2026-01-18T22:40:28Z

@bw4sz re-reading this in light of our discussion on Thursday to make sure I'm clear. Is the idea that image size for prediction is solely set via patch_size + overlap?

@ethanwhite if you're happy, could you approve changes to lift the merge block, please? Then I'll merge.

jveitchmichaelis

Looks good to me. We should probably aim to improve our documentation on how sizes flow through the model as well, but not critical right now.

bw4sz · 2026-01-19T15:46:11Z

@bw4sz re-reading this in light of our discussion on Thursday to make sure I'm clear. Is the idea that image size for prediction is solely set via patch_size + overlap?

@ethanwhite if you're happy, could you approve changes to lift the merge block, please? Then I'll merge.

Yes, that's right, besides the internal retinanet resizing. That still applies for that model.

This was linked to issues Dec 30, 2025

Create tests to assert coherence between inference and evaluation mode. #1252

Closed

Don't log class scores if there is just one class. #1245

Closed

main.evaluate() gives incorrect results when validation images are not all the same size. #1238

Closed

jveitchmichaelis mentioned this pull request Jan 2, 2026

add torchmetrics wrapper for evaluate_boxes #1256

Merged

3 tasks

cursor bot reviewed Jan 6, 2026

View reviewed changes

tests/test_main.py Show resolved Hide resolved

bw4sz added this to the DeepForest 2.1 milestone Jan 7, 2026

cursor bot reviewed Jan 8, 2026

View reviewed changes

src/deepforest/datasets/prediction.py Outdated Show resolved Hide resolved

cursor bot reviewed Jan 8, 2026

View reviewed changes

src/deepforest/datasets/prediction.py Show resolved Hide resolved

src/deepforest/datasets/prediction.py Outdated Show resolved Hide resolved

cursor bot reviewed Jan 9, 2026

View reviewed changes

src/deepforest/datasets/prediction.py Outdated Show resolved Hide resolved

cursor bot reviewed Jan 9, 2026

View reviewed changes

src/deepforest/datasets/prediction.py Show resolved Hide resolved

cursor bot reviewed Jan 11, 2026

View reviewed changes

src/deepforest/main.py Show resolved Hide resolved

bw4sz changed the title ~~[WIP] Simplify evaluation and create tests to assert expected eval behavior~~ Simplify evaluation and prediction to mirror training datasets and create tests to assert expected eval behavior Jan 12, 2026

bw4sz self-assigned this Jan 12, 2026

bw4sz requested a review from jveitchmichaelis January 12, 2026 15:04

cursor bot reviewed Jan 12, 2026

View reviewed changes

src/deepforest/datasets/prediction.py Show resolved Hide resolved

cursor bot reviewed Jan 12, 2026

View reviewed changes

src/deepforest/datasets/prediction.py Show resolved Hide resolved

src/deepforest/main.py Show resolved Hide resolved

jveitchmichaelis requested changes Jan 13, 2026

View reviewed changes

src/deepforest/datasets/prediction.py Show resolved Hide resolved

src/deepforest/datasets/prediction.py Outdated Show resolved Hide resolved

src/deepforest/main.py Outdated Show resolved Hide resolved

bw4sz force-pushed the simplify_evaluate branch 6 times, most recently from 71728ea to ffcd713 Compare January 16, 2026 18:24

ethanwhite requested changes Jan 16, 2026

View reviewed changes

ethanwhite force-pushed the simplify_evaluate branch from ffcd713 to 467edb9 Compare January 16, 2026 19:52

ethanwhite mentioned this pull request Jan 16, 2026

[WIP] Update the docs to favor trainer.validate since it more aligns with p… #1271

Closed

jveitchmichaelis reviewed Jan 18, 2026

View reviewed changes

jveitchmichaelis approved these changes Jan 18, 2026

View reviewed changes

ethanwhite approved these changes Jan 20, 2026

View reviewed changes

jveitchmichaelis merged commit 7010aeb into main Jan 20, 2026
9 checks passed

jveitchmichaelis deleted the simplify_evaluate branch January 20, 2026 16:41

Simplify evaluation and prediction to mirror training datasets and create tests to assert expected eval behavior #1253

Simplify evaluation and prediction to mirror training datasets and create tests to assert expected eval behavior #1253

Conversation

bw4sz commented Dec 30, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI-Assisted Development

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

This PR is being reviewed by Cursor Bugbot

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Jan 9, 2026

Choose a reason for hiding this comment

Wrong predictions accessed due to reset sub_idx in flattened batch

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jveitchmichaelis left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bw4sz commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bw4sz commented Jan 16, 2026

Uh oh!

ethanwhite left a comment

Choose a reason for hiding this comment

Uh oh!

jveitchmichaelis Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jveitchmichaelis commented Jan 18, 2026

Uh oh!

jveitchmichaelis left a comment

Choose a reason for hiding this comment

Uh oh!

bw4sz commented Jan 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bw4sz commented Dec 30, 2025 •

edited by cursor bot

Loading

jveitchmichaelis left a comment •

edited

Loading

bw4sz commented Jan 16, 2026 •

edited

Loading

jveitchmichaelis Jan 18, 2026 •

edited

Loading